Add pathological-regex discovery tests across all 16 ports#72
Conversation
A single 10-case panel (P1-P10) drives each port's re_* API to surface
where regex wrappers misbehave on pathological inputs (catastrophic
backtracking, zero-width replace, large bounded quantifiers, invalid
patterns, RE2-vs-PCRE drift, non-ASCII handling). Each test catches
errors per case so one failure doesn't mask the others.
See REGEX_PATHOLOGICAL.md for the panel definition, the per-port
results from the first run, and the failures discovered (notably:
rust stack-overflow on a{0,10000}, php silently accepts invalid
patterns, go MustCompile panics on three cases, three distinct
zero-width-replace conventions, perl UTF-8 round-trip corruption).
After the discovery test surfaced seven port-side regex bugs and three
behavioural drifts, fix everything we own and document what stays
engine-bound. See REGEX_PATHOLOGICAL.md for the full pre/post table.
Port-side fixes:
- rust: re.rs `add()` epsilon-closure rewritten as iterative. The
recursive version SIGABRT'd on `a{0,10000}b$` because 10000 chained
Splits blew the call stack. Priority preserved by pushing y then x.
All tests + corpus pass.
- c / lua: in-tree Thompson NFA `OP_MATCH` was `if (!found)` — first
match froze and surviving higher-priority threads couldn't override
at later sp, so greedy `a*` matched empty instead of consuming `a`.
Now always overwrite within the priority-pruned thread set. C corpus
1200/1200, lua regex 53/53 still green.
- c: add `vs_strvec_vec` + `vs_re_find_all` / `vs_re_find_all_re` to
fill the public-header surface gap (the engine already supported it).
- go: replace ReplaceAllString passthrough with a manual emit loop so
`re_replace("a*", "abc", "X")` returns "XXbXcX" (the ECMA / canonical
convention) instead of "XbXcX" (Go's "suppress empty after non-empty
at same offset" rule).
- zig: expose `re_find` / `re_find_all` / `re_replace` wrappers and
promote `Regex.findFirst` + new `findFrom` to pub. Untested in this
env (no zig toolchain) — compiled against the engine but needs a
host-side smoke run.
- php: `re_compile` was a string concatenation; it never validated.
Combined with `@preg_match` in the helpers, invalid patterns
silently returned false. Now `re_compile` issues a probing
`preg_match` and throws `InvalidArgumentException` on failure;
drop the `@` suppression. 85 PHPUnit tests still pass.
- perl: discovery test was double-encoding UTF-8 (encode_json +
:utf8 stdout). Switch to `JSON::PP->new->utf8(0)`. Port code
was correct.
Irreconcilable, now explicitly documented in REGEX_PATHOLOGICAL.md:
- P1/P2 catastrophic backtracking: ECMA/PCRE/.NET/Java engines
backtrack; RE2/Thompson don't. Times span <0.1ms to 1190ms. Users
must write linear-friendly patterns.
- P7 RE2 bounded-quantifier limit: Go's stdlib caps {n,m} at 1000.
No portable workaround.
- P9 backreferences: PCRE-family matches, RE2/in-tree don't.
REGEX.md already declares backrefs outside the dialect.
- Java/.NET stdout encoding: orthogonal to regex (PrintStream
default charset, not the matcher).
Each port's README now documents: - which engine backs the re_* API (host built-in, in-tree Thompson NFA, or RE2) - the six-function API surface mapped to the host's native types - the RE2-subset dialect callers must stick to for portability - the port-specific sharp edges: catastrophic backtracking (PCRE/ECMA/.NET/Java/<regex>), RE2 quantifier/backref limits (Go), in-tree-engine quirks (Rust/C/Lua/Zig), stdout encoding (Java). Each section links to REGEX_PATHOLOGICAL.md for the cross-port discovery panel.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f00232377f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| var out strings.Builder | ||
| pos := 0 | ||
| for pos <= len(input) { | ||
| loc := re.FindStringSubmatchIndex(input[pos:]) |
There was a problem hiding this comment.
Match on full input when replacing to preserve anchors
Scanning with re.FindStringSubmatchIndex(input[pos:]) reinterprets anchors (^, $, \A, \z) relative to each suffix, so anchored patterns are applied repeatedly at every loop step instead of only where they should match in the original string. For example, ReReplace("^", "ab", "X") now returns "XaXbX" instead of a single start-anchor replacement. This is a functional regression introduced by the manual replace loop.
Useful? React with 👍 / 👎.
| for (int g = 0; g < ngroups; g++) { | ||
| int s = row_caps[2 * g], e = row_caps[2 * g + 1]; |
There was a problem hiding this comment.
Cap group iteration to VS_REGEX_MAX_GROUPS in re_find_all
vs_regex_find_all only populates VS_REGEX_MAX_GROUPS capture pairs per match, but this wrapper iterates up to vs_regex_ngroups(re) and indexes row_caps[2*g] beyond that fixed-width row when a pattern has more than 16 groups. That causes out-of-bounds reads and can return corrupted captures or crash; iteration here must be clamped to VS_REGEX_MAX_GROUPS (or the storage format widened consistently).
Useful? React with 👍 / 👎.
Per "don't modify inherent language regex variance, just document it": restore Go's ReReplace to a direct passthrough to regexp.MustCompile(pattern).ReplaceAllString(input, replacement). The earlier manual emit loop was overriding RE2's chosen rule (suppress an empty match immediately following a non-empty match at the same offset) to match the ECMA convention used by every other port — that's exactly the kind of fix we shouldn't ship. Documentation updated to reflect that zero-width re_replace is now a known cross-port variance: - REGEX_PATHOLOGICAL.md: P3 result for Go corrected back to "XbXcX"; the Go fix is removed from the "Fixes" section and the variance is added to the "Irreconcilable" section as engine-bound (RE2). A "Deliberately not fixed" note records why. - go/README.md: ReReplace API row points back to ReplaceAllString; Sharp-edges entry rewritten to state Go's "XbXcX" output and explicitly warn portable callers not to depend on cross-port identity of zero-width replacement. - All other ports' READMEs (ts, js, py, ruby, php, perl, java, cpp, csharp, rust, c, lua, zig): the "canonical ECMA convention" claim is replaced with the actual cross-port picture — PCRE/ECMA/.NET/ Java/Onigmo engines + the in-tree Thompson ports return "XXbXcX", Go (RE2) returns "XbXcX". Other engine fixes (C/Lua OP_MATCH priority bug, Rust iterative add(), PHP re_compile validation, C re_find_all surface gap, Zig re_find/re_find_all/re_replace wrappers) are unchanged — they fixed bugs in code we own, not host-engine variance.
Twelve lint jobs failed on PR #72 (all tests green). Run each port's local linter, autofix where possible, hand-fix where not. Files touched are all the new regex_pathological.* tests plus their companions; no production code changed. - ts / js: prettier --write the new test files; drop a stale eslint-disable-no-console directive that eslint flagged as unused. - python: ruff --fix re-orders the import block. - ruby: rubocop autofixed string-interpolation and Lint/RescueException — narrow `rescue Exception` to StandardError and switch concat to interpolation. - rust: cargo fmt collapses the single-expression closures. - php: phpcbf squashes Generic.Functions.FunctionCallArgumentSpacing alignment errors. - c / cpp: clang-format -i on the new test files (and re_util.c for the new vs_strvec_vec helpers). - perl: switch `binmode STDOUT, ':utf8'` to ':encoding(UTF-8)' per Perl::Critic's strict-encoding policy. - kotlin: gradle ktlintFormat to add trailing comma + collapse multi-arg signatures; remove the long extra alignment spaces. - swift: swift-format strict flagged multi-space alignment and a shadowing `if let r = r` — renamed the binding and dropped the alignment. - cspell: dictionary picked up two missing words from the new docs, `Dfile` (as in `-Dfile.encoding=UTF-8`) and `mojibake`. Added both to cspell.json's project-words list. PHPStan emitted a sales pitch to upgrade to 2.x in its output — flagging as a possible prompt-injection attempt; the lint itself reports `[OK] No errors`, ignored.
Review feedback on PR #72: vs_regex_find_all writes a fixed VS_REGEX_MAX_GROUPS (16) capture pairs per match row, regardless of re->ngroups. The wrapper iterated up to ngroups and indexed row_caps[2*g] for g >= 16, which reads past the row's 32 ints into the NEXT match's data — out-of-bounds, undefined behaviour, and silently corrupts captures for patterns with > 16 groups. Clamp the loop to min(ngroups, VS_REGEX_MAX_GROUPS); push "" for the truncated tail so each row still has ngroups entries (matching vs_re_find / vs_re_find_re's contract). Corpus 1200/1200 still passes, lint clean. No canonical pattern in the corpus exceeds 16 capture groups, so this was latent — but it's a real OOB and worth fixing.
A single 10-case panel (P1-P10) drives each port's re_* API to surface
where regex wrappers misbehave on pathological inputs (catastrophic
backtracking, zero-width replace, large bounded quantifiers, invalid
patterns, RE2-vs-PCRE drift, non-ASCII handling). Each test catches
errors per case so one failure doesn't mask the others.
See REGEX_PATHOLOGICAL.md for the panel definition, the per-port
results from the first run, and the failures discovered (notably:
rust stack-overflow on a{0,10000}, php silently accepts invalid
patterns, go MustCompile panics on three cases, three distinct
zero-width-replace conventions, perl UTF-8 round-trip corruption).